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Abstract 

LSTD is numerically instable for some ergodic Markov chains with preferred visits among some states 
over the remaining ones. Because the matrix that LSTD accumulates has large condition numbers. In 
this paper, we propose a variant of temporal difference learning with high data efficiency. A class of 
preconditioned temporal difference learning algorithms are also proposed to speed up the new method. 
It includes LSPE, and several new data efficient algorithms. The data efficiency of these algorithms is 
validated by learning an absorbing Markov chain. Also, the asymptotic properties of the new algorithms 
are analyzed. 

1. Introduction 

Recently the data efficiency and complexity of reinforcement learning algorithms attracted much attention. The 
central issue is the tradeoff between the two. It seems that in order to achieve fast real-time performances, data 
should be forgotten after each transition; conversely, in order to make full use of data, one has to somewhat 
sacrifice the real-time performances for data efficiency — but as a result, increases the computation per time 
step. The former is the way of TD [Sutton88], and the latter is the way of least-squares temporal difference 
learning (LSTD) [Bradtke96] [Boyan99] [boyan02] . There are several algorithms strive a compromise between the 
two properties, such as prioritized sweeping [Moore93] , recursive least-squares temporal difference learning (RLS- 
TD) [Bradtke96] [Xu02recursive] , Least squares policy evaluation (LSPE) [Bertsekas96] [Nedic03] [Bertsekas04] and 
incremental least-squares temporal difference learning (iLSTD) [Geramifard06] . 

This paper also aims to enhance the data efficiency of TD methods. It is organized as follows. In the remaining 
of this section, we introduce some notations. In section [2] we show that LSTD may be numerically instable. 
A model-based algorithm is given in section [3| In section [4] the preconditioned temporal difference learning 
algorithms are introduced. Their data efficiency is discussed in section [5j Finally, their asymptotic properties 
are presented in section [6] 

1.1. Notations 

For the discounted Markov chains, we need the following notations, a £ [0, 1] is the discount factor. S = 
{1,2, . . . , N} is the state set, where N is a finite or infinite positive integer. P — [Pi,j] NxN is the transition 
probability matrix, and / is the N x N identity matrix. g(x tl x t +i) is a stochastic scalar reward on the transition 
from state Xt to state Xt+\- [ff]jvxi ' IS vector whose ith component is Pij9 (hj)- The steady-state vector is 

7T = [tt (1) , 7r (2) , . . . , 7r (N)] , where ' is the transpose operator. The matrix diag(X) is the diagonal matrix of 
X. 

We need some more notations to work with the linear function approximation. <f>k (■) : S i— > 1Z K , k = l,...,K, 
is a family of independent basis functions, [^l^xx i s a matrix formed by lining </>&(•) as its kth column. [w]^- xl 
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is the weight vector to be learned. V = <frw is the approximated value of the optimal state value vector or 
cost-to-go vector of Markov chains. 

Eligibility trace is widely used by a class of algorithms including TD [Sutton88], LSPE 
[Bersekas96][Nedic03][Bertsekas04], LSTD [Boyan02], RLS-TD [Bradtke96] [xu02] and iLSTD [Geramifard06] . 
Let A be its factor. Given an ergodic Markov chain, we denote the steady state distribution vector as 
d = (d(l), d(2), . . . , d(N)) T . D is a diagonal matrix whose ith diagonal entry is equal to d(i). The common aim 
of these algorithms is to iteratively compute the root of f (<&, P, g, w) = Aw + b, where 

A = &D{M — /)<&, (1) 

and 

b=<I>'Dq, (2) 
with M = (1 - X)a (I - XaPy 1 P, and q = (I - XaP)^ 1 g. 

2. The numerical instability of LSTD 

A discounted Markov chain is shown in Figure [l] There are three states. As long as e € (0, 1) and < e + / < 1, 
the chain is ergodic. Consider that e is relatively small, the transition between state 1 and state 2 is much more 
frequently than that between state 3 and either state 1 or state 2. 



Figure 1. A Markov chain with preferred visits between state 1 and state 2. The transition probability is marked on each 
arch. 



We plot the condition number of LSTD's matrix using the 2-norm. As shown in figure [2] LSTD for solving this 
problem suffers from large condition numbers of its matrix. The simulation uses e = 0.02, and / = 0.05. The 
state is represented in the lookup table. Each point is the condition number of the averaged matrix over 20 
trajectories. This is an indication that LSTD may be numerically instable for similar Markov chains where one 
subgroup of states are visited more often than the others. If there is a small perturbation e in the coefficient matrix 
or coefficient vector of LSTD, the relative error of the solution can be ne [Golub96], where k is the condition 
number of the LSTD's matrix. The time there is a large condition number, is exactly the time when LSTD 
becomes numerically instable. Figure [3] shows the synchronization of the two events for a typical trajectory. 

3. The algorithm of bAr 

For an observation of a new transition x t — > Xt+i, and an immediate reward g{xt, Xt+i), we compute 

b t = z t g(x t ,x t+ i), 

and 

A t = z t {ot(j){xt+i) - (j>{x t ))' 
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Figure 2. The condition number of LSTD's matrix for solving the problem in Figure [T] exp(-) denotes the natural 
exponential function; cond(X) denotes the condition number of the matrix of LSTD(A). 



with the old eligibility trace vector z t . Then, TD learning can be formulated as follows [Tsitsiklis97] : 

r t +i =r t +Jt (An + h) . (3) 

The eligibility trace are updated by 

zt+i = Xaz t + <j> (x t+ i) . (4) 

The matrix-form variant ([3]) generates exactly the same predictions with the original TD methods [Sutton88], 
which learns according to the temporal difference signal and eligibility trace vector. The variant is not com- 
putationally economic as compared with the original TD. However, it sheds light on two facts: the incoming 
information of vector bt and matrix A t is independent of parameter r t ; except the pre-specified learning rate "ft, 
bt and A t determine everything for the learning dynamics of rt- This shows that the success of the TD methods 
may lie in b t and A t which come in as the valuable information on the transition xt — > x t +\. 

On the other hand, the inefficient use of data of the TD methods can also be observed from this matrix-form 
variant. The issue of data efficiency of course still lies in them. Hence, we aim at improving the data efficiency of 
rt by improving the data efficiency of the coefficient parameters: the vector and matrix that are used to update 
r. In this way, it is expected that the data efficiency of TD methods would be improved. 

A natural idea is to extract more information from b t and A t . In fact, through a sequence of the temporal values 
of b t and A t , the well-defined but unseen vector b and matrix A in (|2|(|l| can be gradually unveiled. Denoting bt 
and At respectively as the estimation of b and A at step t, we can move the old estimates b t -i and A t ~i towards 
the new observations b t and A t in the following ways: 

6 t = 6 t _i+7 t (6 t -6 t _i), (5) 

and 

A t = A t -i+'yt(At-At-i). (6) 

The two are the well-known Robbins-Monro algorithms and first proposed in [Nedic03] . With the standard rule 
for learning rate: 

oo oo 

It > 0, ^ 7t ~> °o, ^2 It < °°, 

t=0 4=0 

and certain conditions on the Markov chains and basis functions [Tsitsiklis97][Tadic01][Nedic03], b t and A t would 
be a consistent, unbiased, and maximum likelihood estimate of b and A respectively. The fact that A t and bt will 
asymptotically converge to A and b respectively was explored by [Bradtke96][Tadic01][Bertsekas04]. [TadicOl] 
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Figure 3. The prediction of LSTD and the condition number of LSTD's matrix. 



pointed out that the underlying guarantee is the law of large numbers. Since the law can be generalized from the 
"frequency-probability" view, if we interpret ([5| and ^ in a "frequency-probability" manner, the understanding 
will be more intuitive. For simplicity, we focus on the lookup table representation. 



Choosing 1/t as the learning rate in ([5|) and (|6|. First we break A t into and A\ z> , where 

t 

! 



{(2) 



A [ t ] = -^Zk<t>{x k )' 

k=0 
t k 



k=0j=0 



and 



so that A t = -A { t ] 



1 ' 



k=0 



uA t ' . Let us first consider A = a = 1. In this case, 



1 t t 
A? = tE^IX 3 *)' 



j=0 k=j 
L 



where 



1 * 



(7) 



(8) 



(9) 



(10) 



Let Tix k ,i{ko, t) be the number of times that from state a;fe _|_i , how many times state I is seen in the sub- 
trajectory Xh ~ x t . For example, if there is a full trajectory x k = 1,2,2, 1,2,3, 1, where < k < t(— 6), we 
have 711,2(0, 6) = 3 and rti,2(3, 6) = 1. Denoting [a(i, l)]u as the matrix whose ilth entry is a(i, I), we can rewrite 
equation (|9| as: 

1 * 

3=0 



.4 
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where i — Xj and I S S. Let fu(t) be the number of times that state i visits state I till step t (which means there 
is a route from i to I). One can verify that 



^2nu(j,t) = fu(t). 

j=0 



(12) 



Next we classify the different types of visits from z to I. Let /ii (t) be the number of times that state i visits 
state I after a steps till time t; fj(t) as the number of times that state i has been visited till step t. Also define 
f£\t) = fi(t). According to (111 and (121, we have 



Mt)ft 5u) (t) + --- + flt\t) 



^ [*(<) (6 a + Pu + Pi + ■■■)], 



(13) 



t=o 



where 8u is the Kronecker delta. 

For A^:lorQ!^l,in([7]), the parameter decays the visits originated from state Xj to state x k by looking k—j steps 

into the history. Therefore, Aj — > -DX^fcLo ( a XP) k with probability one. Similarly — > DP^'^Lo {\aP) k 
with probability one. 

The analysis of b t is also similar. For (aA) £ [0, 1], we have 

1 * k k-- 
h = ~j:^2^2( aX ) k 3 <j>(xj)g(x k ,x k+1 ) 

k=0 3=0 
N 



where F is (i, aA) = (/£> (*) + ■■•+ (aA)' (t)) 

w, ■ 



^^- F is(i,aA)5(Z,s) , 
J / 

With similar arguments to that of A\ , we have 



^fl(t) F ls (t,aX) 



^ ^(/)^(p is + («A)PA + .--)5a^) 

oo 

= ^(oEE(« A ) tp /, +1 5(^)- 



(14) 



s6S t=0 



Denote X(Z, :) as the ith row of matrix X; X{\, s) as A's sth column. We can rewrite ( 14 1 as: 

oo 



sGS t=0 
oo 



s£S 



n(l)J2(aX) t P t (l,:)^2P(:,s)g(l,s) 

t=Q si 
oo 



i=0 



r w.p.l 

h — ► 



t=o 



(15) 
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Since 



-A 



(i) 



t(2) w.p.l 

a A; ' — ► 



k=0 



(aP - /) : 



(16) 



it is interesting that (15 1 and (16 1 have a common part as bracketed in the two equations. The common part 
indicates that b t and A t asymptotically accumulate some mutual information that can be counteracted. In 
fact, this is the key to guarantee that TD, LSTD, RLS-TD, LSPE and the new algorithms below will converge 
asymptotically to the optimal value, r* = (I — aP)^ 1 ^. 

After the leaning sub-steps of b t and A tl we can learn the parameter rt+i according to: 

(l t r t + 6 t ), (17) 

where 7 is a constant learning rate. Finally the eligibility trace is updated according to Q. 

The learning equation (17 1 is different from TD in two aspects: 7 is not a diminishing learning rate; bt and A t 
replace b t and A t respectively. The new algorithm bases the learning of r on the accuracy of estimated values of 
b and A. So we call it bAr. 



4. The preconditioned temporal difference learning 



In this section, we show that bAr can be accelerated by a technique called "preconditioning" [Golub96]. In 
practice, an equivalence to f — 0, i.e., a preconditioned system 

V~ 1 Ax = -V~ 1 b (18) 

is often preferred, where V is called the preconditioned The process of transforming f — into the equivalent 
system ( 18 1 is called "preconditioning" . It is hoped that the preconditioned system possess more favorable 



properties for iterative computation. (18 1 can be iteratively solved by: 

'- 1 (-A + V) r T - V- X b. 



= V 



(19) 



In the domain of reinforcement learning, f is not known. But based on the idea of bAr, f = can be approximated 
by the stochastic system, A t r t = —b t . And we can apply some stochastic preconditioner to precondition it. 



Therefore, the corresponding stochastic counterpart of (18 1 and (19 1 is 



and 



Vf X A t x = -Vt~ 1 b u 



l = r t + 1 C t 1 {A t r t + b t ) 



(20) 



(21) 



where 7 is the step-size. The system (201 and iteration (21) differ from (18) and ( 19 ) in that the vector bt, and 



the matrices A t and Vt are stochastic. But each of them has a well-defined asymptotic limit. Therefore, we call 
(20 1 the approximate preconditioned system of ff = 0, and ( |21[ ) the preconditioned temporal difference learning 
for solving the cost-to-go vector prediction problem. bAr (7 = 1) is a special preconditioned temporal difference 
learning algorithm by choosing —I as the preconditioner. The approximate preconditioned algorithm stands for 
a wide class of algorithms as demonstrated in the following subsections. 

For the purpose of preconditioning, it is required that the approximate preconditioner Vt have three properties: 
first, Vt should be invertible with probability one; second, Vt should be economic to construct and apply; third, 



the preconditioned system (20) can be asymptotically solved. The last aspect means that the spectral radius of 
Vt~ 1 (—At +Vt) should asymptotically fall into [0, 1). 

4.0.1. Extended LSTD 

Choosing — A t as the preconditioner, according to pTj), we have 



(22) 



r t +i = r t + j(-A t ) (A t r t + b t ) 
= {\- 1 )r t + 1 {-A t - 1 b t ) . 
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If we set 7 = 1, equation ([22]) is exactly LSTD; a value of 7 € (0, 1) calculates a linearly weighted sum of the 



former estimation r t and LSTD's estimate of the current step. Equation (22) is more general than LSTD and 
we call it extended LSTD (ELSTD). 

In our experiments with ELSTD, some intermediate value 70 G (0, 1) produces more accurate solution than 
LSTD. The reasons are as follows. First, in the initial stage of learning where the estimated model (A t , bt) is not 
accurate enough, the weighted value do not require that r t strictly satisfy the model equation, and thus more 
reasonable than LSTD. Second, after the LSTD's estimations become reliable, the history of LSTD's estimates 
can be reused to accelerate LSTD. The reuse of history follows the intuition that more recent estimations are 
more reliable than those made in the distant past. In particular, ELSTD calculates an exponentially weighted 



sum of LSTD's estimates over time t. According to (22 1, substituting recursively for rt,rt-x, ■ ■ ■ ,ri and tq, we 
have 

r t+1 = (1 - 7)V + 7(1 - 7)*-^™ + • • • + 7(1 - l)r^ D + 7^™ 

The technique is widely used in many areas such as numerical analysis, neural networks (generalized delta rule 
[Haykin94], page 170), and times series forecasting (exponential smoothing Bowerman93, chapter 8). 

4.0. 2. LSPE 

LSPE is a special PTD algorithm. D t can be chosen as the preconditioner and an preconditioned model equation 
—Dt l A t w = Df~ bt is obtained. The induced PTD update at time t is, 

r t+ i =r t + -/(tDt)- 1 ((tA t )r t + tb t ) , (23) 

which is exactly LSPE. 
We can rewrite LSPE as, 

n+i = n + 1 D t - 1 (Atn + bt) 

= r t + 7 (t^ X) r t - r^j 

= (l- 7 )r t+7 T t (A) r t . (24) 
where the second equation follows according to Theorem [I] In ( 24 1 , if 7 = 1 , we have 

rt+i = lfV t , (25) 

i.e., LSPE using 1 as the step-size can be seen as directly applying the temporal TD operator. While applying 
TD operator is not a practical algorithm, applying temporal TD operator can provide a practical way to learn 
w. For 70 € (0, 1), LSPE using 70 has a similar form to that of ELSTD. In particular, directly applying the 
temporal TD operator gives a history of estimates; LSPE calculates an exponentially weighted sum of these 
estimates according to their recency. This gives a new understanding of LSPE algorithm. 

4.1. The approximate preconditioned algorithms based on partitioning 

As seen in the previous section, LSPE (7 = 1) is the preconditioning technique choosing — D t as the approximate 
preconditioner. Several classical methods are based on the partitioning A t into —T) u —£t and —J-"t, where — T> t 
is the diagonal matrix of A tl —£ t is the strictly lower triangular matrix and —Tt is the strictly upper triangular 
matrix of A t . 

We could use the following approximate preconditioners: 

V t ,ac = -V u Vf s = -V t - £ t , V s t OR = -{-V t - w£ t ). 
The approximate Jacobi iteration, the approximate Gauss-Seidel iteration, and the approximate Successive Over 



Relaxation (SOR) iteration are obtained by putting Vf' ac , "P t GS and J>!~j OR into (211 respectively. 
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To get some understanding, for simplicity, we consider only A = and the lookup table representation below. 
Let Z t = Vt' 1 (-A t + Vt \ and G = diag(P), we have: 

Z t LSPE = I + D t - l A t 

w -^l I + D^DiaP-I) 
= aP d ^ Z LSPE , 



and 



Z 3 t ac = I + Vt^At 

I+iD-aDGy'A 



I - (I - aCy 1 (I - aP) = f Z 



lac 



If the chain has no self-transitions, approximate Jacobi iteration is the same with LSPE (7 = 1). But T> t equals 
D t only when there is no self-transitions in the chain. Therefore, generally approximate Jacobi iteration and 
LSPE (7 = 1) are different. 

Let (X)l be the lower triangular matrix of matrix X. 

Z? s = I+{V t +E t )- l A t 

^ I+{{D{I- a P)) L }- x A 

= I+{D{(I-aP) L )}- 1 A 

= I-{(I-aP) L }- 1 (I-aP)^Z GS . 

The induction of Z SOR is similar to that of Z GS . Now the relations among LSPE, approximate Jacobi iteration, 
approximate Gauss-Seidel and approximate SOR are clear. If the model is known, one can use the following 
iteration, 

r T+1 =V- 1 {I-aP + V)r T -V- 1 g, (26) 



to solve the deterministic system (aP — I)r* = —g. In (26 1, if V is replaced respectively by —I, —{I — aG), 
— (I — aP)L and (—(I — aG) — lu(I — aP)c)/oj, where (X)c is the strictly lower triangular matrix of X, we get 
exactly Richardson's iteration (with 1 as the learning rate), Jacobi iteration, Gauss-Seidel and SOR, respectively, 
for solving the deterministic system using Z LSPE , Z Jac , Z GS and Z SOR respectively as the iteration matrix. 

5. Improved data efficiency 

In this section, we compare the class of approximate preconditioned algorithms with TD and LSTD on an ab- 
sorbing example modified from Boyan's [Boyan02]. For infinite horizon problems, one may refer to [Bertsekas04] . 
The transition probability matrix used is 



Figure 4. An example modified from Boyan's. 
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and the cost is g^ = 1 (j = 6), or —Q (j ^ 6). 

We focus on A = 0. The lookup table presentation is used because it enables accurate solutions. Figure |5]is the 



Figure 5. Comparisons of different algorithms. All algorithms start from the same initial weight. 

averaged prediction error over 20trajectories.TD(l/t) uses a learning rate of l/t, where t is the trajectory number. 
As shown in the figure, all the approximate preconditioned algorithms exhibit higher data efficiency over TD. ft 
should also be noted that the TD's learning curves in Figure [5] and the infinite horizon problems [Bertsekas04] 
both have an initial period of fluctuations because of its stochastic nature. The noisy effects is gradually re- 
strained by the diminishing learning rate. However, except that this may consume some time and may not predict 
reliably as fast, it may also make TD methods slow down and stop learning too early. As in Figure [5j after 20 
steps, TD(l/£) almost stops. For this reason, j t (c,T Q ) = c( t°+P is often used [Bersekas96][Boyan02][Bertsekas04] 
[GeramifardOG] . The rule aims to seek a balance between restraining noise and keeping the learning continu- 
ing. This learning rule, however, adds much labor in selecting the parameters To and c , requiring T /^j^ 1 ■ C2 ^ c Cl 
times of the complexity of individual TD learning, where [Ti,T2] and [ci,C2] are the pre-selected intervals, and 
ATo and Ac are the step-size of finding the best To and c. Assume f = c ^°_ x , we have 7t(c, To) < l/t (if t < T), 

and jt(c, T ) > l/t (if t > T). For example, we use jt = ^^oTT^' ^ s snown m Figure 5 exactly at t = T = 11, 
there is an intersection of TD(l/t) and TDf/yt). Before 11 steps, the variation of TD(7tJis smaller than that of 
TD(l/t); but after 11 steps, the variations of TD(7 t ) are bigger. 

LSTD is numerical stable on this example and gives the best prediction. But its advantage over approximate 
Jacobi iteration, LSPE, approximate Gauss-Seidel iteration and approximate SOR is not significant. Approxi- 
mate Gauss-Seidel and approximate SOR even approach the solutions of LSTD in no more than 5 steps. Given 
that one could not previously know whether the problems are similar to that presented in section [2j on which 
LSTD is not numerically stable, the approximate preconditioned algorithms may be more reliable. 



6. Asymptotic properties of the approximate preconditioned algorithms 
6.1. The temporal TD operator 

In the learning processes of bAr and LSPE (7 = 1) there is a very important operator. The operator is temporal 
so it is called the temporal TD operator. <f> is also combined into the temporal TD operator. As a result, the 
temporal TD operator operates in TZ K while the TD operator operates in 1Z . Thus, the temporal TD operator 
allows function approximation, ft is given by the following definition. 

Definition 1 (Temporal TD operator). Under Assumption 1, 2 and 3 given in the appendix. Let H t = A t + D t ■ 
If D t is invertible, let gt = D t ~ l bt- Then for A G [0, 1] and w € TZ K , 

T t W w = Di\H t w + b t ) (27) 
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is well-defined for any t such that D t is invertible. 



The following relation is true according to (15 1, (16 1 and the definition of in [Tsitsiklis97]. 
Theorem 1. In the lookup table representation, for VJ £ TZ N , 

T t (A) J -> T (A) J , w.p.l. 

The temporal TD operator is visible in the operations of bAr and LSPE (7 = 1). As an intuitive understanding, 
in the lookup table representation, bAr can be rewritten in the following form using the temporal TD operator: 

r t +i = r t + (-Al +aA 2 t )r t + b t 

= r t + D t (T t W r t -r t 



where the second equation follows according to (151 (16 1 and (27 1. According to Theorem [Tj in the lookup table 
representation, bAr will asymptotically get near to the steady-state algorithm [Tsitsiklis97] : 



f T+x = fy + &D (T x (*fy) - *f T ) . (28) 



Note that the original algorithm has a diminishing learning rate. ([28]) replaces the term by a constant factor. 
The connection between temporal TD operator and LSPE (7 — 1) is also direct: 

rt+i = A" 1 ((A t + D t )r t + bt. 



n + D t - x D t [r\ X) r t - n 

r t+ (T t ^r t -r t ), 



Therefore, LSPE (7 — 1) is an algorithm that directly applies the temporal TD operator to the current value of 
r; it is an iterative method for solving the linear system of equations: J = T t J. 

6.2. Convergence 

The approximate preconditioned algorithms can all be seen as the linear Robbins-Monro algorithms with no 
diminishing learning rate. In traditional Robbins-Monro algorithms, there is a diminishing learning rate that 
restrains the long-term effects of noise. However, if the stochastic iteration matrix has some special structure 
and the noise satisfies some conditions, the diminishing learning rate is not needed any more. 

The following lemma is a variant of proposition 3.1 in [Bertsekas04] . 

Lemma 1. Assume [Zt] nxn , [ r t] nx i an< ^ W n xi are a ^ stochastic. Let us consider the following algorithm: 

r t+1 = {Z t r t +v t ) + e t+1 , (29) 

with the following conditions: 

(a) Z t — > Z , and v t — * v w.p.l. 

(b) maxi<;<i(|cr/(Z)|) < 1, where (Ji(Z) is the Ith eigenvalue of matrix Z and \ ■ \ is the module operator. 



In [Bertsekas04-], it is established that, when tt+i = 0, (29) converges with probability one as long as (a) and (b) 
hold. 

The following theorem gives the convergence of the class of preconditioned leaning algorithms. According to 
lemma [l] we just need to verify that each algorithm falls into a form of 29 and conditions (a) and (b) hold, 



as well as e = 0. These facts arc evident from the proof of theorem [3j It can be easily extended to the linear 
function approximation, therefore, we omit the proof here. 

Theorem 2. For any initial value, under similar assumptions with those in [Tsitsik- 
Us97j[Tadic01][Nedic03][Bertsekas04-J, the algorithm of bAr, approximate Jacobi iteration, approximate 
Gause-Seidel and approximate SOR (u> E (0,2)) converge to r* = —A~ 1 b with probability one. 
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6.3. Convergence rates 



The definition of the asymptotic convergence rate of (29 1 is similar to that of the deterministic iteration VargaOO. 
According to (29 1 and lemma 6.7 of [Bertsekas96NDP] , we have 

||r t+ i-r*|| = \\Z t wt + v t - r*\\ 

= \\Z(w t -r*) + (Z t -Z)w t + v t -v\\ 
< \\Z(w t -r*) || + o(u t )\\w t \\+o(u t ), 

where u <E (0, 1). Therefore, — \n{p{Z)) can evaluate the asymptotic convergence rate of the stochastic iteration 
(29 1. As Z is not known for the algorithms, — \n(p(Z t )) could be used as the approximate convergence rate. 

The following theorem compares the convergence rates of the approximate preconditioned algorithms. The 
convergence rate of approximate SOR depends on u>, and here we consider only u> = 1. For simplicity, bAr is 
considered with 1 as the learning rate. 

Theorem 3. Assume the Markov chain is ergodic. In the lookup table representation, each of the following 
inequality holds with probability one: 

< p(Z? s ) < p(Z^) < p(Z t LSPE ) < p{Z b t Ar ) < I, 
where '=' holds at 1° when there is no self transitions in the chain; '—' holds at 2° when there is only one state. 

The theorem states that, with probality one, approximate Gausse-Seidel iteration is faster than approximate 
Jacobi iteration; approximate Jacobi iteration is faster than LSPE (7 = 1); LSPE (7 = 1) is faster than bAr 
(7=1). 

Proof. p(Z LSPE )<p(Z bAr ) < 1. First, p(Z LSPE ) < H-Z^ 5 ^^ = a. At the same time, a is an eigenvalue of 
Z LSPE , therefore, p(Z LSPE ) = a. According to the Perron-Frobenius theorem VargaOO, there exists a positive 
real eigenvalue uq, such that, 



<r = p(Z bAr ) > mia^Ztf* 

j 

= min{l - (1 - a)di} > a. (30) 



In the third equation, the equality holds only when there is only one state. Furthermore, do < ||^ b " 4r ||oo < 1; 
therefore, we have p(Z LSPE )<p(Z bAr ) < 1. 



p(Z Jac )<p(Z LSPE ). According to (26 1 and Gersch gorin's theorem VargaOO, we have 

N 



p(Z Jac ) < max <^ V Pll 



. a (I - pa) . 

max < — > < a. (31) 



i=i,...,iv l_ 1 — cq>i 

< p (Z GS ) < p (Z Jac ). This is true according to Stein and Rosenberg's theorem VargaOO. □ 
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